[Gupta\* *et al.*, 5.(5): May, 2016] IC<sup>TM</sup> Value: 3.00



# INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

ISSN: 2277-9655

**Impact Factor: 3.785** 

## LOW POWER HIGH PERFORMANCE ANALYSIS FOR 64 BIT ARITHMETICAL LOGICAL UNIT

## Shikha Gupta\*, Mrs Jigyasha Maru

\*Department of Electronics and Telecommunication, CVRAMAN Kota, C.G, India Department of Electronics and Telecommunication, CVRAMAN Kota, C.G, India

**DOI**: 10.5281/zenodo.52497

#### **ABSTRACT**

As we know we are in the age of Internet of things based technology. Where every device is control by the web or app applications so for those type of technology there is need of fast system which will compute the data as fast as possible with less battery consumption. The core of every embedded device and processor which in turn uses ALU as the workhorse. As we know if workhorse require less power, speed and area so based on that workhorse complete system will make justice with SPAA metrics (Speed, Power, Area and Accuracy). This project proposed an architecture of 64 bit General Purpose ALU .The critical power dissipation can be avoided by the application of clock gating of the hardware required and improving architectural approach for this in which we divide ALU is four sub block of 16-16 bit. These 64 bit ALU is identify input bit and according to that it will perform operation. Due to these logic we can save power consumption. The synthesized architecture will be implemented by Hardware descriptive language (Verilog). Analysis will be performing on on FPGA (Field Programmable Gate Array) level.

KEYWORDS: ALU, BIT, FPGA, VLSI, POWER, SPEED, AREA.

#### INTRODUCTION

An ALU is a combinational logic circuit, meaning that its outputs will change asynchronously in response to input changes. In normal operation, stable signals are applied to all of the ALU inputs and, when enough time (known as the "propagation delay") has passed for the signals to propagate through the ALU circuitry, the result of the ALU operation appears at the ALU outputs. The external circuitry connected to the ALU is responsible for ensuring the stability of ALU input signals throughout the operation, and for allowing sufficient time for the signals to propagate through the ALU before sampling the ALU result. In general, external circuitry controls an ALU by applying signals to its inputs. Typically, the external circuitry employs sequential logic to control the ALU operation, which is paced by a clock signal of a sufficiently low frequency to ensure enough time for the ALU outputs to settle under worst-case conditions. For example, a CPU begins an ALU addition operation by routing operands from their sources (which are usually registers) to the ALU's operand inputs, while the control unit simultaneously applies a value to the ALU's opcode input, configuring it to perform addition. At the same time, the CPU also routes the ALU result output to a destination register that will receive the sum. The ALU's input signals, which are held stable until the next clock, are allowed to propagate through the ALU and to the destination register while the CPU waits for the next clock. When the next clock arrives, the destination register stores the ALU result and, since the ALU operation has completed, the ALU inputs may be set up for the next ALU operation

Due to wide spread use of microprocessors and signal processors, implementation of high performance arithmetic hardware has always remained an attractive design problem. Arithmetic and Logic Unit (ALU) is the workhorse of microprocessors and determines the speed of operation of the processor. All modern processors include stand alone hardware for computation of basic arithmetic operations. In addition to fast arithmetic hardware, processors are also equipped with on-chip memory (cache) to achieve significant performance improvement by avoiding delay due to data access from main memory.



[Gupta\* et al., 5.(5): May, 2016] ISSN: 2277-9655 ICTM Value: 3.00 Impact Factor: 3.785

An ALU has a variety of input and output nets, which are the shared electrical connections used to convey digital signals between the ALU and external circuitry. When an ALU is operating, external circuits apply signals to the ALU inputs and, in response, the ALU produces and conveys signals to external circuitry via its outputs. The Arithmetic Logic Unit is essentially the heart of a CPU. It has more applications in DSP and micro processors. In the past, VLSI designers concentrated more on area, performance, cost and reliability. The least importance was given to power. Now a day's power is given primary importance than area and speed. The two low power logic styles used in ALU are CMOS logic and PTL logic. In present era every portable devices are battery operated and as we know ALU is the brain of the whole system. If brain require heavy power, are and latency so the complete system is require heavy area, power, and latency.

Designers are always giving more importance to power rather than speed, because there is a reliability problem in high performance system. High performance systems often turns hot, and high temperature tends to exacerbate several silicon failure mechanisms. Every 10 degrees Celsius increase in operating temperature roughly doubles a component failure rate. From the environment point of view, the smaller the power dissipation of electronic systems, the lower the heat pumped into the rooms, the lower the electricity consumed and hence the lower the impact on global environment. There is always a trade off between power, area and delay. Depending upon requirement, the designer will selects the low power logic techniques.

The arithmetic logic unit (ALU) is the core of a CPU in a computer. The adder cell is the elementary unit of an ALU. The constraints the adder has to satisfy are area, power and speed requirements.

## LITERATURE REVIEW

**Andrew 1951[1]**: According to this paper present era computer engineers have never stopped trying to improve system performance by optimizing arithmetic units. In 1951, researcher Booth presents a signed binary recoding scheme, this scheme is used to reduce the number of partial products from the multiplier, after some time booth approach is improved.

Dimitri 2003[13]: According to this paper author proposed a 64-bit multiply accumulator (MAC) that can compute one 64x64, two32x32, four 16x16, or eight 8x8 unsigned/signed multiply-accumulations using shared segmentation. Ruchir 2006[19]: There is some more latest research on arithmetic logical unit, in [19] author presents a ALU which contains two sub modules lower bound module and upper bound module. This design perform addition, substraction, multiplication and set operation of union, division is performed by shifting Lower and upper bound module is selected by flag generation. Drwaback of this approach is Hardware size and power consumption is increases For division only shifting property is used.

**Akka 2008[14]:** According to this paper hands, Akkas presented architectures for dual mode adders and multipliers in floating-point [14, 15], This paper presents dual-mode floating-point adder architectures that support one higher precision addition and two parallel lower precision additions. A double precision floating-point adder implemented with the improved single-path algorithm is modified to design a dual-mode double precision floating-point adder that supports both one double precision addition and two parallel single precision additions. and Is seven presented a dual-mode floating-point divider [16] that supports two parallel double-precision divisions or one quadruple-precision division. In [17], Huang present a three-mode (32-, 64- and 128-bitmode) floating-point fused multiply-add (FMA) unit with SIMD support. It is clear that all the above multimode multiple-precision structures can only support a few pre-defined precisions. To the best of our knowledge, our proposed architecture is the first true dynamic precision architecture targeting both fixed point and floating-point ALUs.

**Zhou 2008[21]:** In [21] author proposed a approach which have two type of ALU structure first one is tree and second one is chain. In tree approach require less area as compare to other design. In chain approach latency of design is fast and controlling on operation is simple. The approach can be easily integrated into a processor design environment. Problem with this approach is there is no any control signal ,when any two input A & B are generated so that input is computed by every computation unit. Require extra hardware for Chain Structure. Input, Output Pins are increases in chain structure. Tree structure increase latency according to operation.



[Gupta\* et al., 5.(5): May, 2016]

ICTM Value: 3.00

Abhishek 2012[18]: this paper is totally devoted to design speed, energy and power efficient Arithmetic Logic Unit. Speed of ALU is greatly depends upon the speed of multiplication unit. There are so many multiplication techniques have been devised at algorithmic and structural level. After a thorough study and deep analysis we have found that Vedic Urdhva Triyambakam multiplication algorithm is the best algorithm as it generates partial products in the parallel manner. In this paper we have proposed a new tree multiplication structure based architecture to design this Vedic multiplier. To generate partially generated products divide and conquer approach has been used.

ISSN: 2277-9655

**Impact Factor: 3.785** 

JOSIP 2012[20]: In [20] author proposed a ALU unit for 8 bit microcontroller according to that approach proposed ALU contains three sub modules, ARITHMETIC, LOGIC, and BIT operation. This design perform arithmetic, logical and shifting or rotation operation. This design perform fixed and floating point operation. All sub module are connected with mux by controlling signal mux will select output. Problem with this approach is Cascade connection of Full adder creates a major problem and it will increase latency. In this design arithmetic unit based on cascade connection. Similar in logical unit shifter are connected in cascade form. Limited instruction set only 15 operations are perform.

Getao 2012 [22]: In [22] author proposed a Bit-width aware for the control (ctrl) and carry-out (carry) signals for the design with multiple dynamic precision (DP) operation. Problem of this approach is that it require extra hardware unit for dynamic precision. There is some another logic is developed in present era which is known as Approximation [23, 24] according to that approach there is many application which are error tolerant means the error which is not identify by human eyes. Similar for power reduction there is one useful module which is known as Clock gating [25] according to that it will reduce the dynamic power of the whole system.

**Jagrit 2013[25]:** The clock network in a microprocessor feeds clock to sequential elements like flip-flops and latches, and to dynamic logic gates, which are used in high-performance execution units and array address decoders (e.g. *D*-cache word-line decoder). At a high level, gating the clock to a latch or a logic gate by ANDing the clock with a control signal prevents the unnecessary charging/discharging of the capacitances when the circuit is idle, and saves the circuit's clock power.

## PROPOSED METHODOLOGY

As we know in present era every multimedia and general purpose application demands fast and ultra low power system. In current stage every device is operated on battery power supply and as we know for battery there is some kind of limit ion with their power issues and battery size. We also know for any processing unit there is most important part is Arithmetical logical unit (ALU). Due to heavy arithmetic and logical operation generally ALU require heavy amount of power as compare to complete system. Due to these reason we think for generation of approximate ALU unit. Which is combination of four ALU which are:

- 1. Sixteen Bit Accurate ALU Unit
- 2. Sixteen Bit Accurate ALU unit
- 3. Sixteen Bit Semi Accurate ALU Unit
- 4. Sixteen Bit Approximate ALU Sixteen Bit Approximate ALU



[Gupta\* et al., 5.(5): May, 2016]

ISSN: 2277-9655 ICTM Value: 3.00 **Impact Factor: 3.785** 

| <u>+</u>               |                   |
|------------------------|-------------------|
| Arithmetical Operation | Logical Operation |
| 1. Addition            | 1. And Operation  |
| 2. Subtraction         | 2. Or Operation   |
| 3. Multiplication      | 3. Nor Operation  |
| 4. Division            | 4. Nand operation |
| 5. Square              | 5. Xor Operation  |
| 6. Modules             | 6. Xnor Operation |
|                        | 7. 1's Compliment |
|                        | 8. 2's Compliment |
|                        | 9. Right Shifting |
|                        | 10.Left Shifting  |
|                        |                   |
|                        | <u> </u>          |

In our proposed approach we also use logic of clock gating to reduce the clock power and dynamic power. Here our proposed design will work on total 16 instruction those are followings

## PROPOSED ARCHITECTURE



ALU design. Basically here we are using four different eight bit ALU where 2 are accurate and one is Semi Accurate and one is Approximate. So according to Accurate ALU it will work like regular ALU architecture means it will create 100% accurate output. In semi accurate architecture we propose a new architecture where all logical operation is 100% accurate but in arithmetic operation accuracy level is between 95 to 99%. In pure approximate ALU architecture we are proposed approximate logical operations like 2's compliment and 80 to 90% accurate arithmetic operation.



[Gupta\* et al., 5.(5): May, 2016]

IC<sup>TM</sup> Value: 3.00

#### **RESULT**

In this section we are representing comparative analysis of proposed ALU architecture with existing ALUarchitecture in terms of power, area(LUT), delay and frequency. The FPGA comparison analysis of proposed and accurate are shown below, here hardware analysis is done on artix 7 FPA which is 28nm based technology.

ISSN: 2277-9655

**Impact Factor: 3.785** 

### Simulation output



#### **CONCLUSION**

this thesis basically we devolve a system which is based on 64 bit arithmetical logical unit for the application of general purpose and multimedia application. According to our previous discussion as we already saw there is lots of issues in previous existing

ALU design so in this thesis we try to resolve some of the issues, according to this thesis we are presenting a novel approach for 64 bit ALU architecture which is combination of four sub ALU block of eight bit. Where one block is 100% accurate, rest three are 98%,95% and 90%

Second block is semi accurate, third block is semi approximate and fourth block is pure approximate. All hardware implementation is done on Xilinx 14.2 and design simulation is done on model sim. Here we are using 28nm technology based fpga which is known as Artix-7. Here simulation results shows that there is 116% improvement over delay and frequency. For power there is 48% improvement is done at last as for area there is 50% improvement is done as compare to previous existing approach.

#### REFERENCES

- [1] Booth, "A signed binary multiplication technique," The Quarterly Journal of Mechanics and Applied, vol. 4, no. 2, pp.236-240, 1951.
- [2] C. S. Wallace, "A suggestion for a fast multiplier," Electronic Computers, IEEE Transactions on, vol. EC–13, no. 1, pp. 14–17, 1964.
- [3] K. S. Hemmert and K. D. Underwood, "Fast, Efficient Floating-Point Adders and Multipliers for FPGAs," Technology, vol. 3, no. 3, 2010.
- [4] S. Anderson and J. Earle, "The IBM system/360 model 91:floating-point execution unit," IBM Journal of, no. January, 1967.
- [5] P. Seidel and G. Even, "Delay-optimized implementation of IEEE floating-point addition," IEEE Transactions on Computers, vol. 53, no. 2, pp. 97-113, Feb. 2004.
- [6] R. M. Jessani and M. Putrino, "Comparison of single- and dual-pass multiply-add fused floating-point units," IEEE Transactions on Computers, vol. 47, no. 9, pp. 927-937, 1998.
- [7] division algorithms for an x86 microprocessor with a rectangular multiplier," in 2007 25th International Conference on Computer Design, 2007, pp. 304-310.

[Gupta\* *et al.*, 5.(5): May, 2016] IC<sup>TM</sup> Value: 3.00

[8] G. Even and P.-M. Seidel, "A comparison of three rounding algorithms for IEEE floating-point multiplication," IEEE Transactions on Computers, vol. 49, no. 7, pp. 638-650, Jul.2000.

ISSN: 2277-9655

**Impact Factor: 3.785** 

- [9] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, "Wordlength optimization for linear digital signal processing," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 22, no. 10, pp. 1432-1442, Oct. 2003.\
- [10] G. a. Constantinides, P. Y. K. Cheung, and W. Luk, "Optimum and heuristic synthesis of multiple word-length architectures," IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, vol. 13, no. 1, pp. 39-57, Jan.2005.
- [11] D.-U. Lee, A. A. Gaffar, R. C. C. Cheung, O. Mencer, W.Luk, and G. A. Constantinides, "Accuracy-Guaranteed Bit-Width Optimization," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 10, pp.1990-2000, Oct. 2006.
- [12] X. Wang, "VFloat: A Variable Precision Fixed-and Floating-Point Library for Reconfigurable Hardware," ACM Transactions on Reconfigurable Technology, vol. 3, no. 3, pp.1-34, 2010.
- [13] D. Tan, A. Danysh, and M. Liebelt, "Multiple-precision fixed point vector multiply-accumulator using shared segmentation," in 16th IEEE Symposium on Computer Arithmetic, 2003. Proceedings., 2003, vol. 00, no. C, pp. 12-19.
- [14] A. Akkas, "Dual-mode floating-point adder architectures," Journal of Systems Architecture, vol. 54, no. 12, pp. 1129-1142, Dec. 2008.
- [15] A. Akkas and M. Schulte, "Dual-mode floating-point multiplier architectures with parallel operations," Journal of Systems Architecture, vol. 52, no. 10, pp. 549-562, Oct. 2006.
- [16] A. Isseven and A. Akkas, "A Dual-Mode Quadruple Precision Floating-Point Divider," in Signals, Systems and Computers, 2006. ACSSC
- [17] L. Huang, S. Ma, L. Shen, Z. Wang, and N. Xiao, "Low CostBinary128 Floating-Point FMA Unit Design with SIMD Support," IEEE Transactions on Computers, vol. PP, no. 99,pp. 1-8, 2011.
- [18] Gupta, A., U. Malviya, and Vinod Kapse. "Design of speed, energy and power efficient reversible logic based vedic ALU for digital processors." Engineering (NUiCONE), 2012 Nirma University International Conference on. IEEE, 2012.
- [19] Gupte, Ruchir, et al. "Pipelined alu for signal processing to implement interval arithmetic." Signal Processing Systems Design and Implementation, 2006. SIPS'06. IEEE Workshop on. IEEE, 2006.
- [20] Divic, Josip, and Marino Debeljuh. "Model of 8-bit microprocessor intended for lecturing." MIPRO, 2012 Proceedings of the 35th International Convention. IEEE, 2012.
- [21] Zhou, Yu, and Hui Guo. "Application specific low power ALU design." Embedded and Ubiquitous Computing, 2008. EUC'08. IEEE/IFIP International Conference on. Vol. 1. IEEE, 2008.
- [22] Liang, Getao, JunKyu Lee, and Gregory D. Peterson. "ALU Architecture with Dynamic Precision Support." Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on. IEEE, 2012.
- [23] Kyaw, Khaing Yin, Wang Ling Goh, and Kiat Seng Yeo. "Low-power high-speed multiplier for error-tolerant application." 2010 IEEE International Conference of Electron Devices and Solid-State Circuits (EDSSC). 2010.
- [24] Lee, Kyoungwoo, et al. "Error-exploiting video encoder to extend energy/qos tradeoffs for mobile embedded systems." *Distributed Embedded Systems: Design, Middleware and Resources.* Springer US, 2008. 23-34.
- [25] Kathuria, Jagrit, M. Ayoubkhan, and Arti Noor. "A review of clock gating techniques." *MIT International Journal of Electronics and Communication Engineering* 1.2 (2011): 106-114.